Feature Selection for Record Data

The foremost aim of the Feature Selection module within this project is to meticulously pinpoint and select the most pertinent and insightful features (variables or attributes) present in the dataset, tailored specifically for the task at hand. By doing so, we can significantly elevate the performance of the model, mitigate overfitting, and bolster the clarity and comprehensibility of the outcomes. In this section, we are focusing on labeled record data. Results of this section suggest the best model to use for analysis and forecast the future performance using train and test models.

Code
library(tidyverse)
library(tidyquant)
library(ggplot2)
library(forecast)
library(astsa) 
library(xts)
library(tseries)
library(lubridate)
library(plotly)
library(dplyr)

#load df
ihe_df <- read.csv("cleaned_data/IHE.csv")

ihe_df$Date = as.Date(ihe_df$Date)

Train and Test Model

Initially, we will partition the dataset into training and testing subsets. Following that, we will proceed to train the model using the auto.arima() function, setting the stage for us to execute the prediction function subsequently. Train model will be use in test model by calculation benchmark methods and errors to compare.

Code
ihe.ts = subset(ihe_df, select = Adj.Close)

ihe.ts = ts(ihe.ts, start=c(2019,1),frequency = 365.25) #per day for stock market price

ihe.diff = diff(ihe.ts)

train <- ts(ihe.diff[1:799])
test <- ts(ihe.diff[800:1005])

fit = auto.arima(train, seasonal = FALSE)

Forecast

We have constructed two distinct ARIMA models, utilizing the values of p, q, and d derived from prior analyses. This approach enables us to forecast future stock market values by comparing the performance of the two models, ultimately allowing us to ascertain which model delivers the most accurate predictions.

Code
arimaModel_1 <- arima(ihe.diff, order = c(1,1,1))
arimaModel_2 <- arima(ihe.diff, order = c(1,1,0))

forecast1=predict(arimaModel_1, length(test))
forecast2=predict(arimaModel_2, length(test))

#ts to df
ts_df <- data.frame(date = time(ihe.diff), value = as.numeric(ihe.diff))
train_df <- data.frame(date = time(ihe.diff)[1:799], value = as.numeric(ihe.diff)[1:799])
forecast1_df <- data.frame(date = time(ihe.diff)[800:1005], value = forecast1$pred)
forecast2_df <- data.frame(date = time(ihe.diff)[800:1005], value = forecast2$pred)

#plot forecast
ggplotly(ggplot() +
    geom_line(data = train_df[1:799,], aes(x = date, y = value, 
              color = "Train Values"), linetype = "solid", alpha=0.6, show.legend = TRUE) +
    geom_line(data = forecast1_df, aes(x = date, y = value, color = "ARIMA(1,1,1)"), 
              linetype = "solid", show.legend = TRUE) +
    geom_line(data = forecast2_df, aes(x = date, y = value, color = "ARIMA(1,1,0)"), 
              linetype = "solid", show.legend = TRUE) +
    labs(x = "Date", y = "IHE Stock Price", title = "Forecasting ARIMA(1,1,1) and ARIMA(1,1,0)") + theme_minimal() +
    scale_color_manual(name = "Forecast", 
                       values = c("ARIMA(1,1,1) Forecast" = "lightblue","ARIMA(1,1,0) Forecast" = "lightpink"),
                       labels = c("ARIMA(1,1,1) Forecast", "ARIMA(1,1,0) Forecast")))

As you can observe, the ARIMA(1,1,1) model outperforms the ARIMA(1,1,0) model. Consequently, our attention will now shift to concentrating on the ARIMA(1,1,1) model. We will proceed to run predictions using a more advanced visualization tool, incorporating both differenced data and drift to enhance the accuracy and interpretability of our results.

Code
ihe.diff %>% Arima(order = c(1,1,1), include.drift = TRUE) %>% forecast %>% autoplot()+ ylab("IHE Stock Market Prediction")+theme_minimal()

The prediction plot provides a visual representation of the confidence intervals and potential future scenarios for stock market prices.

We can now narrow our focus to a smaller prediction window in order to observe the differences and assess how the prediction forecast performs in comparison to benchmark methods.

Windor for Forecast (2020-2021)

Code
ihe.diff2 = window(ihe.diff, start=2020, end=2021)
autoplot(ihe.diff2)+ggtitle("IHE Stock Market Cost 2020-2021 (1st-order differenced)")+xlab("Year")+ylab("IHE Stock Market Cost (Differenced)")+theme_minimal()

The selection of the time window between 2020 and 2021 is deliberate, owing to the outbreak of COVID-19 in 2020, which exerted a profound influence on the entire community. This pandemic has precipitated significant shifts, altering everything from previous stock prices and fundamentally transforming our societal landscape.

Benchmark Methods - Metrics

Here, we can employ both the mean and the naive methods as baseline forecasting approaches, providing a comparative analysis alongside the more sophisticated ARIMA model. This comparison allows us to better assess the performance and suitability of each method in predicting future data points.

Code
ihe.mean = meanf(ihe.diff2, h=10)
checkresiduals(ihe.mean)

    Ljung-Box test

data:  Residuals from Mean
Q* = 66.997, df = 73, p-value = 0.6756

Model df: 0.   Total lags used: 73

Code
ihe.naive = naive(ihe.diff2, h=10)
checkresiduals(ihe.naive)

    Ljung-Box test

data:  Residuals from Naive method
Q* = 191.2, df = 73, p-value = 1.564e-12

Model df: 0.   Total lags used: 73

Code
ihe.diff2=window(ihe.diff, start=2019, end=2022)
autoplot(ihe.diff) + autolayer(meanf(ihe.diff, h=10), series = "Mean", PI=FALSE)+ autolayer(naive(ihe.diff, h=3), series = "Naive", PI=FALSE)+
  ihe.diff %>% Arima(order = c(1,1,1), include.drift = TRUE) %>% forecast %>% autolayer(series = "ARIMA(1,1,1)", PI=FALSE)+
  ggtitle("Forecasting IHE Stock Market Price")+xlab("Year")+ylab("Stock Price (Differenced)")+guides(colour=guide_legend(title = "3 Yr Forecast"))+theme_minimal()
Warning message in window.default(x, ...):
“'start' value not changed”
Warning message in window.default(x, ...):
“'end' value not changed”

The displayed plot illustrates forecasts extending three years into the future, utilizing ARIMA, mean, and naive methods for prediction. Evidently, the ARIMA model significantly outperforms the mean and naive methods, showcasing a higher level of accuracy and reliability in its predictions. Given these observations, it is reasonable to deduce from the forecast plot that the ARIMA model stands out as the superior choice, prompting us to proceed with it as the preferred model for our forecasting endeavors.

This assertion can be substantiated by delving into tables of benchmark methods, where a comparative analysis of accuracy between the training and testing models can be conducted, alongside computations of Mean Error (ME), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Absolute Scaled Error (MASE), and AutoCorrelation Function at lag 1 (ACF1). In this segment, while the accuracy is being calculated automatically, we will also engage in a comprehensive discussion on the various evaluation metrics employed to gauge the Naive Bayes classifier’s performance, including accuracy, precision, recall, and the F1-score, ensuring a thorough understanding and robust assessment of the model’s capabilities.

Code
pred <- forecast(fit, 326) #annual
accuracy(pred)
f1 <- meanf(train, h = 326)
accuracy(f1)
f2 <- naive(train, h = 326)
accuracy(f2)
f3 <- rwf(train, drift = TRUE, h = 326)
accuracy(f3)

#calculate by using undergrad notes
mae1 <- abs(mean(as.numeric(pred$mean) - as.numeric((test))))
mae11 <- abs(mean(as.numeric(f1$mean) - as.numeric((test))))
mae12 <- abs(mean((as.numeric(f2$mean) - as.numeric((test)))))
mae13 <- abs(mean((as.numeric(f3$mean) - as.numeric((test)))))

mse1 <- abs(mean((as.numeric(pred$mean) - as.numeric((test)))^2))
mse11 <- abs(mean((as.numeric(f1$mean) - as.numeric((test)))^2))
mse12 <- abs(mean((as.numeric(f2$mean) - as.numeric((test)))^2))
mse13 <- abs(mean((as.numeric(f3$mean) - as.numeric((test)))^2))

df <- data.frame(
  Model = c("Arima", "Mean Forecast", "Naive", "Random Walk Forecast"),
  MAE = c(mae1, mae11, mae12, mae13),
  MSE = c(mse1, mse11, mse12, mse13)
) %>% mutate_if(is.numeric, round, 3)
print(df)
A matrix: 1 × 7 of type dbl
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.1068385 1.927508 1.443424 Inf Inf 0.6684412 0.02125072
A matrix: 1 × 7 of type dbl
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1.974266e-17 1.990964 1.463855 -Inf Inf 0.6779027 -0.1309626
A matrix: 1 × 7 of type dbl
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.00222457 2.995515 2.159388 -Inf Inf 1 -0.6000132
A matrix: 1 × 7 of type dbl
ME RMSE MAE MPE MAPE MASE ACF1
Training set 6.904112e-17 2.995514 2.159349 -Inf Inf 0.9999819 -0.6000132
Warning message in as.numeric(pred$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(f1$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(f2$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(f3$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(pred$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(f1$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(f2$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
Warning message in as.numeric(f3$mean) - as.numeric((test)):
“longer object length is not a multiple of shorter object length”
                 Model   MAE   MSE
1                Arima 0.031 2.384
2        Mean Forecast 0.102 2.394
3                Naive 1.739 5.409
4 Random Walk Forecast 2.103 6.844

Upon scrutinizing the benchmark methods table, it becomes evident that the ARIMA model outperforms the rest, boasting the lowest Mean Absolute Error (MAE) and Mean Squared Error (MSE) values. This unequivocally positions the ARIMA model as the most suitable and accurate for this particular analysis. Conversely, the Random Walk Forecast method registers the highest MAE and MSE values, starkly highlighting its inefficacy and rendering it an unsuitable choice for reliable predictions in this context.

Evaluation Metrics for Naive Bayes Classifier

As previously discussed, in benchmarking methods, we employ evaluation metrics such as accuracy to gauge the effectiveness of the Naive Bayes model and to facilitate comparisons with other models.

Accuracy, is computed as the proportion of correctly predicted instances to the entire dataset. In R, this can be calculated using the accuracy() function. Although accuracy serves as a useful initial metric, it may not be adequately informative in scenarios involving imbalanced datasets, where the prevalence of one class substantially exceeds the other.

Precision, also referred to as positive predictive value, is another vital metric. It represents the fraction of true positive predictions relative to the sum of all positive predictions made. To calculate precision in R, one can use the precision() function.

Recall, synonymous with sensitivity, denotes the proportion of true positive predictions out of all actual positive instances present in the dataset. It is indicative of the model’s proficiency in identifying positive instances. The sensitivity() function in R can be employed for this calculation.

F-1 score, it embodies the harmonic mean of precision and recall, thus incorporating both false positives and false negatives into its calculation. Particularly useful in the context of imbalanced class distributions, the F-1 score ranges between 0 and 1, with 1 representing perfection and 0 denoting the worst possible score. In R, this score can be computed using the F_meas() function, providing a comprehensive and balanced performance evaluation.

Overfitting and Underfitting

Overfitting occurs when a model excessively attunes itself to the training data, to the extent of assimilating noise as though it were a legitimate pattern. This phenomenon not only results in a skewed understanding of the data but also compromises the model’s ability to generalize, leading to biased and inaccurate predictions when applied to new, unseen data. A practical approach to diagnosing overfitting involves contrasting the model’s performance on the training dataset with its performance on a validation or test set, providing insight into whether the model is performing optimally or if it has succumbed to overfitting.

Underfitting occurs when a model is overly simplistic, failing to capture the underlying complexities and patterns within the data. This condition results in biased and inaccurate outcomes, as the model lacks the necessary intricacies to make precise predictions. To alleviate underfitting, one can introduce complexity to the model, integrate additional features, or engage in feature engineering, enhancing the model’s ability to understand and interpret the data accurately.

To determine whether your Naive Bayes model is prone to overfitting or underfitting, a straightforward strategy involves evaluating and contrasting its performance on the training dataset against that on the validation or test dataset. If the model demonstrates stellar performance on the training data but falters significantly on the test data, it is indicative of overfitting. Conversely, consistent poor performance across both training and test datasets is a hallmark of underfitting, signaling that the model is struggling to capture the underlying patterns in the data.

Conclusion

To effectively utilize Naive Bayes for categorizing stock trends in R, one must initiate by delineating the problem as a binary classification task, aimed at anticipating whether stock prices will rise or fall. This entails the meticulous construction of features from historical pricing, market behavior indicators, and trading volumes that the algorithm can utilize for prediction. Subsequent to the creation of binary labels reflecting price fluctuations, and ensuring the independence and numerical nature of the features, the dataset must be split into training and test groups in a manner that honors the chronological sequence of the data. Selecting a Naive Bayes model that matches the data’s distribution is critical, followed by training, thorough evaluation through metrics, and judicious application for forecasts. Considering the intricacies and unpredictable nature of financial data, it is essential to bolster the Naive Bayes predictions with further analytical insights. Despite Naive Bayes providing a solid starting point, the intricacies of financial forecasting often require more advanced models that can account for complex interdependencies not assumed by Naive Bayes.